Skip to content

Conversation

@sjberman
Copy link
Collaborator

@sjberman sjberman commented Apr 23, 2025

As a route to efficacy and quickly understanding the Gateway API, its implementation and alignment to NGINX as a data plane, we decided on a simplified, but rigid, deployment pattern. To improve our security posture and installation flexibility the control and data planes are being separated as semi-autonomous, distributed components. This also allows us to support multiple Gateways for a single control plane.

A general summary of the changes being made:

  • control plane and data plane are now in separate Deployments
  • installing NGF just installs the control plane
  • when a Gateway resource is created, the control plane provisions an nginx data plane deployment and service
  • the NginxProxy CRD resource can now be set at the Gateway level, and has been enhanced to include all deployment/service infrastructure-related fields, such as replicas, loadBalancerIP, serviceType, etc.
    • these fields can be configured globally at installation time in the helm chart, or set on an individual basis per Gateway
    • updating these fields directly on a provisioned nginx Deployment or Service will not take effect
    • this does not apply to the control plane Deployment
  • labels/annotations for the NGINX deployment or service can be set in the Gateway's Infrastructure section
  • the NGINX pod uses the NGINX agent (currently an unofficial, unreleased version) to update NGINX configuration
  • control plane communicates with the NGINX agent over a secure gRPC connection, using self-signed certs by default, created at installation time. Cert-manager can be used instead.
  • multiple Gateways is now supported

Design: https://github.com/nginx/nginx-gateway-fabric/tree/main/docs/proposals/control-data-plane-split
Epic: #1508

Checklist

Before creating a PR, run through this checklist and mark each as complete.

  • I have read the CONTRIBUTING doc
  • I have added tests that prove my fix is effective or that my feature works
  • I have checked that all unit tests pass after adding my changes
  • I have updated necessary documentation
  • I have rebased my branch onto main
  • I will ensure my PR is targeting the main branch and pulling from my branch from my own fork

Release notes

If this PR introduces a change that affects users and needs to be mentioned in the release notes,
please add a brief note that summarizes the change.

BREAKING CHANGES:

<link to upgrade documentation and anything else that may be relevant>

The following change are breaking and require users to fully uninstall NGINX Gateway Fabric (including NGINX Gateway Fabric CRDs) before re-installing the new version. Gateway API resources (such as Gateway, HTTPRoute, etc) are unaffected and can be left alone.

- Control plane and data plane have been separated into different Deployments.
   - the control plane will provision an NGINX data plane Deployment and Service when a Gateway object is created.
- NginxProxy CRD resource is now namespace-scoped (was cluster-scoped).
- NginxProxy resource controls infrastructure fields for the NGINX Deployment and Service, such as replicas, loadBalancerIP, serviceType, etc. Users who want to set or update these fields must do so either at installation time through the helm chart (which sets them globally), or per Gateway. Updating these fields directly on a provisioned nginx Deployment or Service will not take effect.
   - this does not apply to the the NGINX Gateway Fabric control plane Deployment.
- Helm values structure has changed slightly to better support the separate Deployments.

FEATURES:
- Support for creating and deploying multiple Gateways.
- NginxProxy resource can now additionally be attached to a Gateway, and will overwrite any settings that are attached at the GatewayClass level, for the Gateway that it's attached to.

@github-actions github-actions bot added documentation Improvements or additions to documentation dependencies Pull requests that update a dependency file change Pull requests that introduce a change helm-chart Relates to helm chart labels Apr 23, 2025
@codecov
Copy link

codecov bot commented Apr 23, 2025

Codecov Report

Attention: Patch coverage is 74.13074% with 372 lines in your changes missing coverage. Please review.

Project coverage is 86.73%. Comparing base (7bce264) to head (a66267b).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
internal/mode/static/manager.go 3.73% 103 Missing ⚠️
internal/mode/static/nginx/agent/command.go 81.11% 57 Missing and 11 partials ⚠️
cmd/gateway/commands.go 71.73% 50 Missing and 2 partials ⚠️
internal/mode/static/handler.go 77.01% 32 Missing and 8 partials ⚠️
cmd/gateway/certs.go 75.67% 25 Missing and 11 partials ⚠️
internal/framework/controller/predicate/secret.go 56.75% 12 Missing and 4 partials ⚠️
internal/framework/file/file.go 77.58% 12 Missing and 1 partial ⚠️
internal/mode/static/nginx/agent/action.go 83.33% 8 Missing and 2 partials ⚠️
internal/framework/controller/resource.go 0.00% 7 Missing ⚠️
...ernal/framework/controller/predicate/annotation.go 76.00% 4 Missing and 2 partials ⚠️
... and 7 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3318      +/-   ##
==========================================
+ Coverage   86.20%   86.73%   +0.52%     
==========================================
  Files         116      129      +13     
  Lines       11928    14862    +2934     
  Branches       62       62              
==========================================
+ Hits        10283    12891    +2608     
- Misses       1580     1822     +242     
- Partials       65      149      +84     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

sjberman and others added 18 commits May 14, 2025 10:43
Removing the nginx runtime manager and deployment container since nginx will live in its own pod managed by agent. Temporarily saving the nginx deployment and service for future use.

Updated the control plane liveness probe to return true once it's processed all resources, instead of after it's written config to nginx (since nginx may not be started yet in the future architecture).
Updating the nginx docker containers to build and include agent. Once agent is officially released, we can use the published binary instead of building.

Added a temporary nginx deployment to the helm chart to deploy a standalone nginx pod.

Added the basic gRPC server and agent API implementation to allow for the agent pod to connect to the control plane without errors.
Added the following:
- middleware to extract IP address of agent and store it in the grpc context
- link the agent's hostname to its IP address when connecting and track it
- use this linkage to pause the Subscription until the agent registers itself, then proceeding

This logic is subject to change as we enhance this (like tracking auth token instead of IP address).
Problem: When the control plane and data planes are split, the user will need the ability to specify data plane settings on a per-Gateway basis. To allow this, we need to support NginxProxy at the Gateway level in addition the the GatewayClass level. In practice, this means a user can reference an NginxProxy resource via the
spec.infrastructure.parametersRef field on the Gateway resource. We still want to support referencing an NginxProxy at the GatewayClass level. If a Gateway and its GatewayClass reference distinct NginxProxy resources, the settings must be merged. Settings specified on a Gateway NginxProxy must override those set on the GatewayClass NginxProxy.

Solution: To support NginxProxy at the Gateway level several changes were made to the API.
As a result, the API is now at version v1alpha2.

Breaking Changes:
* Change the scope of the CRD to Namespaced. The parametersRef.namespace field on the GatewayClass is now required.
* Make DisableHTTP2 and Telemetry.Exporter.Endpoint optional.

New fields:
* Telemetry.DisabledFeatures: allows users to explicitly disable telemetry features. It is a list with one supported entry: DisableTracing. More features may be added in future releases.

Other changes:
* Remove the listType=Map kubebuilder annotation from the RewriteClientIP.TrustedAddresses field. This listType is incorrect since TrustedAddresses can have duplicate keys.

The graph now stores NginxProxies that are referenced by the winning GatewayClass and Gateway. This will need to be updated once we support multiple Gateways. The graph is also responsible for merging the NginxProxies when necessary. The result of this is stored on the graph's Gateway object in the field EffectiveNginxProxy. The EffectiveNginxProxy on the Gateway is used to build the NGINX configuration.
This commit adds functionality to send nginx configuration to the agent. It also adds support for the single nginx Deployment to be scaled, and send configuration to all replicas. This requires tracking all Subscriptions for a particular deployment, and receiving all responses from those replicas to determine the status to write to the Gateway.
Problem: The NGINX Plus API conf file was empty when sending using OSS, which caused an error applying config. This also revealed an issue where we received multiple messages from agent, causing some channel blocking.

Solution: Don't send the empty NGINX conf file if not running N+. Ignore responses from agent about rollbacks, so we only ever process a single response as expected.
Add leader election to allow data plane pods to only connect to the lead NGF pod. If control plane is scaled, only the leader is marked as ready and the backups are Unready so the data plane doesn't connect to them.

Problem: We want the NGF control plane to fail-over to another pod when the control plane pod goes down.

Solution: Only the leader pod is marked as ready by Kubernetes, and all connections from data plane pods are connected to the leader pod.
This commit updates the control plane to deploy an NGINX data plane when a valid Gateway resource is created. When the Gateway is deleted or becomes invalid, the data plane is removed. The NginxProxy resource has been updated with numerous configuration options related to the k8s deployment and service configs, which the control plane will apply to the NGINX resources when set. The control plane fully owns the NGINX deployment resources, so users who want to change any configuration must do so using the NginxProxy resource.

This does not yet support NGINX Plus or NGINX debug mode. Those will be added in followup pull requests. This also adds some basic daemonset fields, but does not yet support deploying a daemosnet. That will also be added soon.
* Add back runnables change and call to nginx provisioner enable

---------

Co-authored-by: Benjamin Jee <[email protected]>
…3147)

Support nginx debug mode when provisioning the Data Plane.

Problem: We want to have the option to provision nginx instances in debug mode.

Solution: Add debug field to NginxProxy CRD. Also user can set debug field when installing through Helm by setting the nginx.debug flag.
Continuation from the previous commit to add support for provisioning with NGINX Plus. This adds support for duplicating any NGINX Plus or docker registry secrets into the Gateway namespace.

Added unit tests.
With the new deployment model, the provisioner mode for conformance tests is no longer needed. This code is removed, and at a later date the conformance tests will be updated to work with the new model. Renamed the "static-mode" to "controller".

Also removed some unneeded metrics collection.
Problem: When a user updates or deletes their docker registry or NGINX Plus secrets, those changes need to be propagated to all duplicate secrets that we've provisioned for the Gateway resources.

Solution: If updated, update the provisioned secret. If deleted, delete the provisioned secret.
Update functional tests for the control plane data plane split.

Problem: The functional tests do not pass with the current architecture.

Solution: Add updates to functional tests.
Problem: We want to ensure that the connection between the control plane and data plane is authenticated and secure.

Solution:

1. Configure agent to send the kubernetes service token in the request. The control plane validates this token using the TokenReview API to ensure the agent is authenticated.
2. Configure TLS certificates for both the control and data planes. By default, a Job will run when installing NGF that creates self-signed certificates in the nginx-gateway namespace. The server Secret is mounted to the control plane, and the control plane copies the client Secret when deploying nginx resources. This Secret is mounted to the agent.

The control plane will reset the agent connection if it detects that its own certs have changed.

For production environments, we'll recommend a user configures TLS using cert-manager instead, for better security and certificate rotation.
Problem: The data plane container was not properly handling the kill signal when the Pod was Terminated.

Solution: Update the entrypoint to catch the proper signals.
sjberman and others added 14 commits May 14, 2025 10:43
Problem: Now that we have additional pods in the new architecture, we need the proper SecurityContextConstraints for running in Openshift.

Solution: Create an SCC for the cert-generator and an SCC for nginx data plane pods on startup. A Role and RoleBinding are created when deploying nginx to link to the SCC.
Problem: Users want to be able to configure multiple Gateways with a single installation of NGF.

Solution: Support the ability to create multiple Gateways. Routes and policies can be attached to multiple Gateways.

Also fixed conformance tests.

---------

Co-authored-by: Saylor Berman <[email protected]>
Update non-functional tests for the control plane data plane split.

Problem: The non-functional tests do not work for the control plane data plane split changes.

Solution: Update non-functional tests.

Testing: Scale, Reconfiguration, Performance, and Longevity tests work. Upgrade test doesn't work, however that is sort of planned since the CP/DP split is a breaking change of NGF and thus you can't easily upgrade with zero downtime.

---------

Co-authored-by: Saylor Berman <[email protected]>
…ervice (#3319)

Add ability to set loadBalancerClass for load balancer Service

Problem: We would like the ability to specify the loadBalanacerClass field on a load balancer service.

Solution: Add ability to set loadBalancerClass for load balancer Service.

Testing: Manually tested that deploying NGF with the nginx.service.loadBalancerClass Helm flag would correctly set the field. Also tested that modifying the NginxProxy resource would set the loadBalancerClass when the service was re-created (the field can only be set upon creation).
Problem: All config update events resulted in sending configuration to every Gateway, even if the change was irrelevant.

Solution: Compare new config with old config to determine if a ConfigApply is necessary. Simplified the change processor and handler to no longer have to determine this.
The prometheus logger is no longer needed since we don't collect nginx metrics in the control plane anymore.

Also updated agent dependencies to fix the broken build.
Problem: Now that the control plane provisions the NGINX Service, users can't set specific NodePorts values.

Solution: Allow users to specify NodePorts in the helm chart (globally) and in the NginxProxy resource.
Update documentation on accessing nginx container

Problem: With our incoming changes to our control data plane split, the nginx container will no longer be in the NGF Pod. Thus all documentation on accessing the nginx container (logs, sending traffic, config...) need to be updated.

Solution: Updated the documentation. Mainly, when sending traffic in the examples, the host and IP of the NGINX Service are recorded after the Gateway is deployed. Most of these changes are in our examples.
Collect telemetry for number of data plane pods, control plane pods, and nginx proxy resources attached to a gateway.

Problem: With the control data plane split, there are a few telemetry metrics which need to be updated or added.

Solution: Update/Add telemetry metrics.
Problem: Updating labels/annotations on the Gateway did not propagate to some resources.

Solution: Ensure that labels/annotations are set when updating resources.
* Remove unused service annotations

* Remove files for aws-nlb that rely on service annotations
@sjberman sjberman force-pushed the change/control-data-plane-split branch from f659def to a66267b Compare May 14, 2025 16:43
@sjberman sjberman marked this pull request as ready for review May 14, 2025 16:44
@sjberman sjberman requested review from a team as code owners May 14, 2025 16:44
Copy link
Contributor

@ciarams87 ciarams87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

@sjberman sjberman merged commit 621ec00 into main May 15, 2025
46 of 47 checks passed
@sjberman sjberman deleted the change/control-data-plane-split branch May 15, 2025 13:28
@github-project-automation github-project-automation bot moved this from 🆕 New to ✅ Done in NGINX Gateway Fabric May 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

change Pull requests that introduce a change dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation helm-chart Relates to helm chart release-notes

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants